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Abstract 

Given a nonlinear model, a probabilistic forecast may be obtained by Monte 
Carlo simulations. At a given forecast horizon, Monte Carlo simulations yield sets 
of discrete forecasts, which can be converted to density forecasts. The resulting den- 
sity forecasts will inevitably be downgraded by model mis-specification. In order 
to enhance the quality of the density forecasts, one can mix them with the uncon- 
ditional density. This paper examines the value of combining conditional density 
forecasts with the unconditional density. The findings have positive implications for 
issuing early warnings in different disciplines including economics and meteorology, 
but UK inflation forecasts are considered as an example. 
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1 Introduction 

Forecasts of a given quantity of interest often come from multiple sources. For instance, 
UK inflation forecasts are issued by both the Bank of Eng land (BOE) and the Na - 
tional Institute of Economic and Social Research (NIESR) ( Hall and Mitchell . 200?! ). 
In f act, the BOE has a suite of forecasting models which may be used to inform pol- 



icy ( Kapetanios et al . 20081 ). Instead of determining the best mode l from multiple 



sources, it may be better to combine the models in some way (jBunnl . Il989l : IClemenl . 
19891 : [Granger and Ramanathanl . Il984l ) ; but combining forecasts raises questions about 
a suitable criterion for assessing the quality of the composite forecast. 

Combining f orecasts has generally been viewed as a way of pooling information 
source s toget her (iBates and Granger , Il969l :l [Granger and Ramanathanl . Il984l : iBunnl . Il989l : 
Wallid . I2OO5I ). Recently, model mis-s pecification has been given as another reason 
why p ooling forecasts may be necess ary ( Hendry and Clementsl . l2004l : iKapetanios et al. 



2008l i iHendrv and Clementsl ^20()4 ) went further to suggest that pooling forec a,sts to- 



gethe r may be viewed as a way of applying the James-Stein 'shrin kage' estimation (jJames and Stein 



196lh. Found ed on multiple parameter estimation problems (jJames and Steinl . Il961 



Casellal . Il985l ) , shrinkage estimates are obtained by mixing the maximum likelihood es- 
timates with the 'grand average' ( Efron and Morris . 19771 ). The grand average is an 
average of all available data (or estimates). Shrinkage estimators are readily applicable 



Abbreviations: Probability Integral Transform (PIT), Bank of England (BOE), Finite Unconditional 
Forecaster (FUF) 
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to combining point forecasts (e.g. iGreis and Gilsteinl . 1 1 99 ih . Moreov er , point forecasts 
have dominated the discussion on forecast combination (jWalhsl . |2005| ) . 

Hitherto, a nota ble departure froni point forecasts to combining density forecasts 
was a discussion by iHall and Mitchelll (|2007l ). who also attributed earlier success of 
combined forecasts to model mis-specification. Considering UK inflation forecasts, their 
study confirmed that combining density forecasts outperforms the individual forecasts. 
Sources of the forecasts they considered were the B OE and NIESR. They co mbined these 
models with what they called 'time series density' (jHall and Mitchelll . 120071 ) . Their 'time 
series density' was a probability density function estimated from available data. The 
time series density may be considered an estimate of the unconditional density. 

Here, we examine the quality of mixture distributions of conditional density forecasts 
an d the unconditi o nal d ensity in light of the goal of probabilistic forecasting set forth 
bv iGneiting et al\ (j2007l ). The goal involves two concepts; calibration and sharpness. 
Calibration is the s tatistical co i isistency between forecast probabilit ies and observed 
relative frequencies ( Brieil . 1950 : Gneiting et al . 2007 : Gneitine . 20081 ) while sharpness 
is a measu re of how concentrated probabilistic forecasts are and a proper ty of the fore- 
casts only (jBrosd . Il953l : ICneitind. [20081: ICneiting et all lioOTl: IWilkd. | 2006l) Calibration 
was also termed validity by BrossI ( 1953 ) and re/ia6?/f^j7 b y Saunders ( 1958). Currentl y 



m 



Gneitinel . I2OO8I : iLawrence et all bood ). 



it is commonly known as calibration (e.g. 
although the weather community also uses the term reliability. While much of t he dis- 
cussio n on calibration of probabilistic forecasts has centred on categorical events, DawidI 
(|l984l ) IS notable for proposing the use of probability integral transforms (PITs) to as- 
sess the calibration of density forecasts. A PIT is obtained by plugging an observation 
into the cumulative predictive distribution function. His proposed test included the 
additional condition that the PITs should be independent and identically distributed. 



Diebold et al\ (jl998l ) then showed that if density forecasts coincide with the ideal fore- 



casts, then the PITs are independe n t and identically uniformly distributed (iid U[0, 1]). 

The proposal of iDiebold et al. (|l998l l to use PITs was motivated by their quest 
for a universally applicable approach to density forecast evaluation. They argued that 
scoring rul es cannot rank incorrect density forecasts in a way tha t satisfies all users. Re- 
cent work (iGneiting and Rafteryl . 120071 : iBrocker and Smith . I2OO7I I has discussed essential 
properties for scoring rules, but it is still unclear whether such can yield a consistent 
ranking of non ideal forecasts. On the other hand, testing whether PITs are iid U[0, 1] is 
only sufficient to determine if the forecasts are ideal or not. It is of no value in providing 
a universal ranking of non ideal forecasts. 

If the iid condition is relaxed, then PITs can be uniform even when the forecasts 
are not ideal. IGneiting et al. (|2007l ) termed this scenario probabilistic calibration. They 
introduced two other modes of calibration: exceedance calibration and marginal calibra- 
tion. Marginal calibration refers to the case when the time average of all predictive 
distributions is equal to that of ideal forecasts. Since the time average of ideal forecasts 
can be estimated from time series, marginal calibration can be empirically assessed. 
There is no empirical way of assessing exceedance calibration and we will defer its defi- 
nition until section [2.31 

(j2007l ) then conjectured that when a subset of these modes of calibra- 



Gneiting et al. 



tion holds, then the predictive distributions are at least as spread out as the ideal fore- 
casts, which conjecture they termed a sharpness principle. The aim was to ensure that 
predictive distributions were no more confident than the ideal forecasts. It has further 
been argued that the goal of probabilistic forecasting is to maximise sharpness subject to 
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calib ration ( Gneiting . 20081 : Gneiting et al . 20071 ). This so called paradigm ( Gneiting . 
20081 ) depends on the aforementioned conjecture, which we shall revisit later. If one 



could identify relevant modes of calibration for the conjecture to hold, then the sharp- 
ness of predictive distributions could be maximised subject to those modes to achieve 
the goal. Maximising sharpness is equivalent to minimising uncertainty. 

This paper presents a new theoretical analysis of the quality of density forecasts in 
terms of sharpness and calibration. It focuses upon combining conditional forecasts with 
an unconditional estimate; this may b e viewed as 'shrinkage' of con di tional forecasts to- 
wards the unconditional distribution (iHendry and Clementsl . bood ). iHall and Mitchelll 
(|2007l l found including the unconditional distribution to improve predictive distribu- 
tions; but merely including the unconditional distribution cannot improve sharpness. 
Therefore, in addition to the mixture para meter, we suggest scalin g the dispersion of 
conditional forecasts. Our analysis answers iHall and Mitchelll (|2007l )'s appeal for more 
theoretical work on combining density forecasts. Empi rical results are g i ven o n UK 
inflation forecasts using the same example considered by Hall and Mitchell ( 200?! ). 

The next section discusses forecast qualities that are cumulatively measured by the 
logarithmic scoring rule. In particular, a decompositioii of thi s scoring rule is presented. 
The sharpness principle conjectured by Gneiting et al. ( 200?! ) is discussed in § [3] and a 
relevant proposition presented. In § (H the methodology employed to produce density 
forecasts and theoretical analyses of forecast combinations are given. Results concerning 
density forecasts obtained via the logarithmic scoring rule with respect to the BOE 
inflation forecasts are presented in § [H Section [6] gives a discussion and concluding 
remarks. Appendices El and |B] contain the proof of the proposition concerning the 
sharpness principle and appendix [Opi'oofs for the rest of the propositions. Appendix ID] 
gives a complementary discussion of point forecasts. 



2 Probabilistic-Forecast Quality 



Model mis-specification places limitations on the value of probabilistic forecasts. On 
the other hand, consumers of forecasts may demand predictive distributions that are 
both calibrated and sharp. If such forecasts are issued at long time horizons, then early 



warning is afforded. We suggest that thes e qua 



the logarithmic scoring r ule proposed by iGood 



availa ble for select ion (see iGneiting and Rafteryl . 120071 ) . For insta nce, there is th e Brier 



i ties ca n be cumulatively quantified by 
There are other scoring rules 



score (|Briei] . ll95d ). This, however, decomposes into many terms (jMurphvl . Il993l ). some 
of which are not relevant to our discussion and it is suitable for categorical events. A 
gener alisation of the Brier score t o density forecasts is the continuous rank probability 
score (jGneiting and Rafteryl . 120071 ) . but it lacks a clear interpretation. Indeed traditional 
decompositions of scoring r ules do not contain a sharpri ess term. There is also the mean 
square error loss function (ICorradi and Swanson . l200fil ). which is also irrelevant to the 
qualities of interest. Suffice it to say, the logarithmic s coring rule is preferr e d ove r 
others for its appeal to infor mation theory concep ts (see Roulston and Smith . 20021 ). 
which can be traced back to Shannon ( 1948I . 19491 ). Information theory has a strong 
hold on uncertainty, a concept equivalent to sharpness. 
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2.1 Logarithmic Scoring Rule 

Consider a density forecast f{x) and a target probability density function g{x). If 
we think of X as a random variable, then the foregoing notation says that the true 
distribution of X is g{x). With this notation, the information based scoring rule used 
in this paper is 

poo 

E[IGN(/,X)] = - / g{x)logf{x)dx, (1) 



where IGN(/, X) = — log f{X), p roposed bvlGoodI (119521) and ter med Ignorance in lRoulston and Smith 
(120021 ) and predictive deviance in iKnorr-Held and Rainerl (j200ll ) . Hence, (fTl) is the e x- 



pecte d Ignorance. It is related to the Kullback-Leibler divergence (jKullback and Leiblerl . 
195ll ). 

9{x) 



^KL(5|I/) 



[ g{x) log ( 

J — oo V 



dx 



by 



=E[IGN(/,X)] + / g{x)\ogg{x)dx. 



It follows that the / that minimises DYj^{g\\f) also minimises E[IGN(/,X)]. The ex- 
pected Ignorance is the cross entropy H{g, /). The Ignorance score is especially relevant 
when one evaluates the performance of density forecasts given time series only, with 
no access to g{x). An important pro perty of the Ignoraiice sco r e is that it attains the 
minim um if and only if f{x) = g{x) ( Brocker and Smith , 20081 : Gneiting and Rafterv . 
20071 ) , meaning it is strictly proper. 

Traditionally, the only score that has been decom posed into con s tituent terms is the 
Brier score: the reliability-resol ution dec o mpos ition ( Murphy . 19931 : Wilks . 20061 ). after 
removing the uncertainty term. Brocker ( 20091 ) extended the decomposition to general 
scores, but in the context of categorical forecasts. Unlike sharpness, resolution is not a 
property of the forecasts only. Therefore, we introduce a decomposition of ([T]) as 



E[IGN(/,X)] 



/(x)log/(x)dx 



[g{x) - f{x)]\ogf{x)dx. 



In this decomposition of expected Ignorance, the first term is sharpness and the second 
is calibration. Notice that the sharpness term is simply the density entropy H{f), a 
property of the density forecast only. It is desirable for this term to be as negative 
as possible, effectively expressing more certainty about what is likely to happen. Since 
calibration is a statistical property of the forecasting system, it cannot be assessed based 
on one forecast only. For a time series of forecasts, we want each /(x) to be close to 
g{x) in some way. One is never furnished with g{x) to aid assessment of calibration in 
an operational setup, but there are time series approaches to address this. 



2.2 Sharpness 



One way to quantify sharpness is to use the variance (e.g. iGneiting et all 120071 ). We 



emphasise that sharpness should be quantified by entropy, which "is a measure of con- 
centration" of the distribution "on a set of small measure" , a small value of entropy 
corresponding to a "high degree of concentration" (IHirschmanl. 11957.1. The en tropy of a 
distribution f{x) of variance cj^ satisfies the inequality (jShannonl ~ 194 



I949I ) 



/(x) log f{x)dx < - log (27re(T^) 
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Hence, a smaller variance guarantees lower entropy but not vice versa. Indeed two dis- 
tributions with the same variances can have unequal entropies. For instance, a mixture 
of two Gaussians will have lower entropy than a single Gaussian distribution of the same 
variance. Much more, a distribution of a higher variance can have a lower entropy than 
that of lower variance. 



Sharpness h as al so been quantified by confidence intervals (jRafterv et al\ . l2005l : 



Gneiting et aU . 120071 ). Confidence intervals share a similar weakness to variance in 



the sense that a bimodal distribution that is fairly concentrated on the two modes can 
have larger confidence intervals than a unimodal distribution that is fairly spread out. 
Also, given two non-symmetric distributions, which of them is deemed sharper could 
depend on what the confidence level is. 

2.3 Calibration 

The calibration of density forecasts is a well trodden subject. Much of the literature 
takes t he stand that a calibrated fore casting system is tantamount to a correctly specified 
model. Corradi and Swanson ( 20061 ) provide a comprehensive survey of formal statistical 
techniques for assessing calibration of dens ity forecasts to determine if the underlying 
model is correctly specified. The work of Gneiting et al. ( 200?! ) strikes a discord by 
providing a calibration framework that accommodates model mis-specification. They 
broke down calibration into three modes, each of which could be assessed separately. 



Suppose a probability forecasting system issues predictive distributi ons ^Ft(x)}i 



while the data-generating process issues ideal forecasts {Gt{x)}(^^. iGneiting et al 



(|2007l l ;hen defined the following modes of calibration: 



The sequence {Ft{x)}j 



T 



is probabilistically calibrated relative to {Gt{x)}J^^ 



if 



1 ^ 



(2) 



t=i 



The sequence {Ft{x)}j:^^ is exceedance calibrated relative to {Gt{x)}]:^i if 

1 ^ 

Y,G^HFt{x)} = x, x€^. 



(3) 



t=i 



The forecaster is marginally calibrated if 

1 ^ 



r->oo T 



t=i 



lim -VG 

t=l 



t[X ■ 



Note that the definitions of probabilistic and exceedance calibration require the distribu- 
tions to be strictly increasing over all JR. In the subsequent discussions, we admit cases 
where the distributions simply have compact support. In such cases, the inverses will 
only be taken within the regions of compact support. If we have a time series of observa- 
tions Xf, then Zt = FAxt) i s a probability integral transform (PIT) (jCorradi and Swanson , 



2006; Diebold et al. 



tion ([Gneiting et al. 



19981) . Uniformity of the PITs is equivalent to probabilistic calibra- 
2OO7I ). A visual inspection of PIT histograms would reveal obvious 
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departures from uniformity. The underlying model is correctly specified if and only if 
zt ~ iid C/[0,1]. 

Suppose we have a time series of density forecasts, {ft{x)}t>i- Then define the 
forecaster's unconditional density as 



Pt{x) 



1 ^ 



t=i 



We define a forecaster who issues the finite time unconditional distribution, 



Ft{x) = Gt{x) 



1 ^ 



x) 



for all t £ {l...,T}, to be the finite unconditional forecaster (FUF). If Ft{x) = 
limr_j.oo Gt{x), then we have the unconditional forecastero- A forecaster is finite marginally 
calibrated if Pt{x) = G'rp{x). 

For all practical purposes, T is finite and we have no access to the Gt{xys. Hence 
it is difficult to assess finite marginal calibration. If d{pT,P2T) ~ 0, where d is some 
metric, then we can take T to be large enough to eval uate marginal calibration. To this 
end, we can use the Hellinger distance (| Foliar dl . |2002| ) and compute 



h{pT,Pu) = ^ 



\/pt{x) - V Pu{3 



dx, 



where pu{x) = limT^oo G'rp{x) is the underlying system's unconditional density. It is 
useful to note that < h{-, •) < 1, assuming the value of when the two distributions 
are identical and 1 when they do not overlap. This procedure for assessing margina l 
calibration is an alternative to the graphical tests performed in iGneiting et al. 
It is expected to be more robust to finite sample effects. 



3 The Sharpness Principle and Early Warning 



Murphy and Wilksl (j 19981 ) highligh ted that forecasts nee d to be calibrated before one 
worries about sharpness. Recently, IGneiting et al\ (j2007l ) adopted a paradigm of max- 
imising sharpness subject to calibration. They then conjectured that the goal to obtain 
ideal forecasts and of maximising sharpness subject to calibration are equivalent, which 
is the sharpness principle. To revisit the conjecture, d enote the variance of a random 
variable whose distribution is F{x) by var(F). Then Gneiting et al. ( 200?! ) define a 
forecaster to be at least as spread out as the ideal forecaster if the inequality 



T T 



(4) 



t=l 



t=l 



holds. With this notion of spread, a weaker alternative states that any sufficiently 



calib rated forecaster is at least as spread out as the ideal forecaster (jGneiting et al 
20071 ). The key word here is "sufficiently". Maintaining our reservations about using 



variance to quantify sharpness, this section is concerned with addressing this weaker 
conjecture. 



Gneiting et al. I (|2007l ) refer to this as the climatological forecaster, even though this may have nothing 



to do with cUmate in a meteorological sense. 
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3.1 Counter Examples 



None of the individual modes of calibration alone is sufficient for the weaker conjecture 
to hold (|Gneiting et ad . l2007l ). In this subsection, we present relevant counter examples. 

Probabilistically calibrated forecaster: Let Gt = U[0,1], t = 1,2, be ideal 
forecasts and the corresponding forecaster 

0, X < 0, 
x/{2/32-*(i-/3)*-i}, xe[0,/3], 

i + (x - - /3)*-i)/{2/3*-i(l - X G [(3, 1], 

1, X > 1 



for t = 1,2 with < /3 < 1/2. This forecaster satisfies equation ([2]), hence is proba- 
bilistically calibrated. Note that when /3 = 1/3, the average variance of the forecaster 
distributions is 37/432 whilst the ideal forecaster yields 1/12. Hence, inequality (|4]) is 
satisfied. If entropy is used to measure sharpness, then H(ft) = (1/2) log(4/3(l — /3)) 
for t = 1,2, where H{ft) denotes the density entropy of ft and /t(x) = Fl{x). Since 
H{gt) = and 4/3(1-/3) < 1, H{ft) < H{gt). Thus entropy indicates that the forecaster 
is sharper than the ideal forecaster. For distributions with the same compact support, 
it is well known in information theory that the uniform distribution yields maximum 
un certa i nty, a fact that is missed by using variance. 

(i2ninl ) gives an example in which ([4]) is violated. He takes 

Gi (x) = 1 - 2e-^^ + , G2 (x) = 1 - e'^^^ , Fi (x) = 1 - e"^^ , F2 (x) = 1 - e~^^ , 

supported on (0,oo), with 6*^ > 3A^. Here, var(Fi) = 1/A^, var(F2) = 1/6'^, var(Gi) = 
5/(4A^) and var(G2) = 1/(40^). Hence (j4]) does not hold, yet the forecaster is proba- 
bilistically calibrated. 

Exceedance calibrated forecaster: Now let Gt 
forecaster whose corresponding forecaster is: 

't-l t' 
2 '2 



U[0,1], t = 1,2 be the ideal 



Ft = U 



t 



XG [0,1/2), 
xe [1/2,1], 



1,2. This forecaster is exceedance calibrated since, 

= X, X G (0, 1) 

yet dH) is violated because the average variance of the forecaster is 1/48 (< 1/12). We 
also have H{ft) = —log 2 (< H{gt)). Both measures concur that the forecaster is 
sharper than the ideal forecaster. 

Marginally calibrated forecaster: Suppose Gt = [0,1], t = 1, 2, . . . , 00 be a 
sequence of ideal forecasts and suppose that a forecaster issues 

Ft = U [{h - l)/n, h/n] 

for some finite n > 1 and kt is a discrete uniform random variable taking values 



X . 



{1,2, ..,n]G. Clearly, the unconditional distribution is ?7[0, 1]. The forecaster's un- 
conditional distribution is given by 

1 ^ 

F(x) = lim -y^Fti 



t=\ 



^Prof T. Gneiting brought this example to my attention through private communication. 
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Note that Ff{x) is a function of a random variable due to its dependence on k^. It 
turns out that the expectation of Ft is C/[0, 1]. Hence, by the law of large numbers, 
F = U[0,1]. Hence the forecaster issuing F^ is marginally calibrated. However, both 
entropy and variance indicate that the forecaster is sharper than the ideal forecaster. 



3.2 Calibration Theorem 

Since none of the individual mode s of calibration is suffi cient for the forecaster to be 
less sharp than the ideal forecaster Gneiting et al. ( 20071 ). we ought to dete r mine if any 



two would suffice. Since IPal ( 20091 ) did not satisfactorily address this (jPal 2O10l ). it is 



revisited. In order to address this conjecture, we define finite marginal calibration as 

T T 

t=i t=i 

Prom a practical point of view, probabilistic and marginal calibration are more important 
than exceedance calibration because they can be assessed empirically. 

PROPOSITION 1. Suppose {Gt}f^i is a sequence of continuous and strictly increas- 
ing distribution functions (ideal forecasts). Then a forecaster who is both probabilistically 
and finite marginally calibrated has either issued ideal forecasts {Gt}J^i or is the finite 
unconditional forecaster. 

The proof for the above proposition is split into two parts and is given in appendix El 
and [HI It is trivial that the ideal forecaster satisfies the probabilistic calibration con- 
dition ([2|) and the finite marginal calibration condition ([5]). It is also trivial that the 
finite unconditional forecaster (FUF) satisfies the finite marginal calibration condition. 
To show that the FUF is probabilistically calibrated, note that for any p G (0, 1), there 
exists rj such that GTiv) = p ^ G^^{p) = rj. Hence 

p = Gt{v) 

t=i 
1 ^ 

t=i 

Including exceedance calibration in the hypotheses of the above proposition would 
rule out the FUF. If the FUF was exceedance calibrated, we would have 

^Y;^G;'{Gt{x)} = x. 
t=l 

Given an x, there exists a ^ G (0, 1) such that Gt{x) = ^ ^ x = G^^{S,). Hence, we can 
eliminate x in the above equality to obtain 



1 ^ 



T 

t=i 
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Unless Gt{x) = G{x) for all t, the above equality is a mathematical fallacy. Therefore, 
the FUF cannot be exceedance calibrated. 

Even though this proposition does not deal with the case when T approaches in- 
finity, in all practical situations we deal with fin i te T. Indeed the graphical tests for 
marginal calibration discussed in iGneiting etah hmi ) deal with finite marginal cali- 



bration. Defining the average predictive distribution to be 

T 



t=i 

and the empirical CDF of the observations by 

1 ^ 

Gt{x) = -^l{xt<x) 



t=i 



where Xt is time series, iGneiting efall (|2007l ) propose plotting a graph of Fx—Gt against 



X to assess marginal calibration. Clearly this is assessing finite marginal calibration. 

The implications (of the proposition) to the goal of probabilistic forecasting are 
that the level of expectation with regard to the two modes of calibration needs to be 
scaled down when the underlying model is mis-specified. This is because, without a 
correctly specified model, one cannot have both perfect probabilistic and finite marginal 
calibration unless he is the FUF. Hence the forecaster should merely aim to maximise 
sharpness subject to some level of calibration. For given levels of probabilistic and 
finite marginal calibration, a forecaster affords early warning if he is sharper than the 
unconditional distribution. 



4 Density-Forecast Estimation 

Suppose we have some data point st, at time t, and we want to know the future state 
at time t + t. We call r the forecast lead time. In order to express uncertainty in 
the forecasts, we issue a density forecast. One way to obtaining a density forecast is 
to generate many points in the neighbourhood of st and iterate them forward with 

the model to obtain an ensemble of forecasts 



N 



at time t + t. 



A comprehe nsive review of how to generat e ensemble forecasts with non-linear models 
is given by iLeutbecher and Palmed (1200811. but if we think of this process as Monte 
Carlo simulation, we may refer to IClements and SmithI (j200l|) for a brief description in 
an econometric setting. This section is concerned with converting the ensemble into a 
density forecast. The Gaussian kernel function. 



1 



exp (-^2/2) , 



may be used to obtain a density function from an ensemble of point forecasts. 



4.1 Single Model 

One way to convert a forecast ensemb l e into a d ensity forecast w ould be to perform 
density estimation according to Parzen (1962) and Silverman ( 19861 ). The fundamental 
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weakness of this approach is that it inhe rently assumes that t he en semble is a draw 
from the true distribution. In view of this, iRoulston and SmithI (j2002l ) suggested taking 
int o account how the mode l has performed in the past. A similar approach is followed 
by iHall and Mitchelll (|2007l l. who use past forecast errors to obtain density forecasts. 
Therefore, we can form density forecast estimates of the form: 



/9W(X) 



^ N 



(6) 



where a and p are respective kernel width and offset parameters chosen according to 
past performance an d K{-) i s the kernel function. The density forecast in ([6]) differs 
from the traditional Parzen ( 19621 ) estimat es by the offse t para, meter. It is similar to 
the Bayesian Model Average proposed by iRaftery et al\ (120051') wi th a uniform bias 
correction, /i and equal weights. Selecting a using [Silverman 113) does not account 
for model mis-specification. 

To account for model mis-specification, let us first denote a record of past time se- 
ries and corresponding ensemble forecasts by Vt = {{st, X^^^)}J^^. Then the density 
forecasts whose parameters, p and a, are selected by taking into account past perfor- 
mance may be denoted by p(*)(a;|Vr). While /j(*)(x|Vt) has the same form as in dH), its 
parameters are selected by doing the minimisation 



mm ■ „ 
a>0,M I T 



1 ^ 



St\VT) 



(7) 



t=i 



Under certain assumptions, doing the minimisation in ([7]) is tantamount to minimising 
either the average cross entropy or the average Kullback-Leibler divergence. Without 
making any assumptions, the term in ([7j) should be called average Ignorance, (IGN). 
Minimising ([7|) is equivalen t to maximun i likeli hood under the assumption of inde- 
pendence of forecast errors (|E,afterv et ali l200,5l ). Moreover, it is equivalent to quasi 
maximum likelihood (QM L) unde r mode l mis-specificati on with i ndepe ndent conditional 
forecasts as discussed by IWhite (|l994l V Interestingly, Iwhitd (|l982l ) called the QML 
estimator the 'minimum ignorance' estimator, arguing that it minimises our ignorance 
about the correct model structure. 



4.2 Mixture Model 



Brocker and Smith ( 20081 ) noted that, when doing the minimisation in ([7]), some of the 



X^*) may be far from the corresponding Sf , which could result in choices of a that were 
too big. Hence, the parameter estimates would not be robust. These short comings 
could largely be due to model mis-specification. To circumvent these, they proposed a 
mixture model of the unconditional density, Puix), and p^*\x\Vt)- 

/W(x|Vt) = a/9W(x|VT) + (1 - a)pu{x), (8) 

where the mixture parameter, a € [0, 1]. All the three parameters are fitted simultane- 
ously by minimising average Ignorance. The unconditional density, Puix), is estimated 
from data via 



Puix) = -—^K{{x-st-pu) lou} 
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and the parameters and fi^ a re then chosen to siniultan e qusly minimise th e logarith- 
mic scoring rule as proposed in Brocker and Smith ( 20081 ). Silverman (| 19861 ) may also 
be followed to estimate the unconditional density. 

No te that ^ w ould be the linear opinio n pool discussed in IClemen and Winkler 
(I1999I ') and used in iHall and Mitchelll (I2OO7I ) if a was the only parameter being se- 



lected; but we also train the a to enhance the sharpness of the mixture distribution (see 
Prop osition HI) . The role of 9 is like that of the shrinkag e in multi-parameter estima- 
tion ( Efron and Morris . 1977 : Hendry and Clementd . 2004 ). If we let rt = p^^\st) / Pu{st) ■, 
then we can state the following proposition: 

PROPOSITION 2. For a given set of parameters fi and a, the necessary and sufficient 
conditions for improvement from including the unconditional density in the sense of the 
logarithmic scoring rule are that 



1 



^ ^ n > 1 and 

t=i 



T ^ rt 

t=i ^ 



The proof for this proposition is given in appendix O Its counter part for point 
forecasts is Proposition [6] in Appendix[Dj The ratio rt may be interprete d as the return 
ratio on some invested capital in a Kelly betting scenario (iKellvl . Il95fil ) with no track 
take. The proposition states how the conditional and unconditional densities are to 
outperform each other in order for the mixture model to provide additional value. 

In order to capture the effect, on kernel width, of including the unconditional density, 
we consider the case when = 1 with /x = 0. When there is no unconditional density 
included, minimising the logarithmic score yields. 



1 ^ 

t=i 



St 



Let us write a time series version of the logarithmic scoring rule as 



1 ^ 

(IGN) = --j;iog/W(.t|VT). 



(9) 



(10) 



t=i 



PROPOSITION 3. Suppose the score given by equation \10\) assumes a minimum at 
parameter values ((7=K,a*), then the following equation holds: 



1 T 

t=l 



st-X 



pW(gtlVT) 



(11) 



See appendix [O for the proof. Corollary [T] in AppendixlD] gives corresponding sharp- 
ness conditions for point forecasts. For illustrative purposes, suppose that the kth 
forecast is far from the corresponding observation in the sense that 



Sk-X 



(k) 



> max { Si - ^ } 



t^k 



As a result, the kernel width in ([9]) would be inflated. Equation (llip provides a way 
to discount the contributions of a few bad forecasts on the kernel width. In this case, 
(fj* , a^, ) would be chosen such that 



f^'HsklVx) 



< 1. 
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This is especially valuable when T is small, which is the case in typical time series. The 
idea is that a reduction in kernel width is necessary for the entropy of /^*^(x|Vt) to 
decrease even when > 1, but it is easier to explain how the reduction is achieved 
when = 1. Despite this reduction, some mixture forecasts may still be less sharp 
than unconditional distribution in the sense of entropy. A straight forward application 
of the Kullback-Leibler (KL) and Jensen's inequalities leads to the relations 

aH {pW } + (1 - a)H{pu) < H } < a^H {pW } + a(l - a)H {pW , p«} + . . . 

(1 - a)aH [pu, } + (1 - afH{pu), (12) 

where H{f) = — J f (x) log f {x)dx and H{f,g) = — f f(x)logg{x)dx are the entropy 
and cross entropy respectively. Therefore, the necessary and sufficient conditions for 
i/ {/(*)} > H{pu) to hold are that H{pu) <H{p^^\pu} and H{pu) < H {p^^^} respec- 
tively. The first inequality of ()12p can be used to establish the following proposition: 



PROPOSITION 4. //i?{pW} < H{pu), then < i.e. in the sense of 

entropy, merely mixing p^^\x) with the unconditional density without re-adjusting the a 
parameter cannot improve the sharpness of the predictive distribution. 

It is not obvious what the effect of the mixture is on calibration, except that in- 
cluding the unconditional density improves the KL distance from the ideal forecasts. 
Nevertheless, the mixture parameter that minimises the logarithmic score yields the 
equation 

1 Pc{st) ^ 

T i^J^'\st\VT) • 
On the other hand, we note that equation ([2]) is equivalent to 

T ^ ft[.st) 

The two preceding equations are similar with the pu replacing gt in (|13p . What happens 
to calibration due the mixture will be explored by way of example in the next section. 



5 Applications 



This section presents the results that highlight the effects, on sharpness, calibration and 
the time horizon over which density forecasts are useful, of introducing the unconditional 
density to form the density forecasts. As an example, the Bank of England (BOE) 
inflation forecasts are considered. Every quarter, the BO E issues GDP and in flation 
quarterly forecasts for up to twelve quarters into the future ( Harrison et al . 20051 ). The 
forecasts come in pairs; those based on constant interest rates and those on market 
interest rates. We shall only consider those based on constant interest rates. The model 
is nonlinear and is an addition to a suite of models whic h are admittedly impe rfect since 
they are simplifications of reality ( Harrison et al. . 20051 : Kapetanios et a/. . 12008 ). 

We shall here consider the Retail Price Index inflation excluding mortgage interest 
payments (RPIX inflation rate). The corresponding forecasts are published on the BOE 
website for the period from 1993 to 2005. The starting point of this period coincided 
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with the BOE startmg to issue an inflation target of 2.5%. Before that time there were 
no inflation targets set. RPIX inflation data is published on the Office of National 
Statistics website. 

At a given lead time, each forecast comprises the parameters mode, mean, median, 
uncertainty and th e skew parameter. A two-piece normal distribution is then used to 
produce fan-charts ( Britton et al. . 19981 ) which are then published on the BOE website. 
Only the parameters mode (or central projection), ^f, uncertainty, at, and skewness, 
7t, are required to completely s pecify the two - piece distribution. I n a nutshell, the 
probability distribution issued is teritton et d\ . Il998l : lOneitind . 1201 ll ) 



exp 



X > /it. 



(14) 



where 



<7i,t = at/'x/l+Tt, o-2,t = (^t/ vl 



It- 



One may think of the central projection as a one member ensemble (or Monte Carlo 
simulation) . The entropy of this density function is given by 



^(Pi) = log {7V2(fTl,t + C72,t)} + ^. 



Dowdl highlighted that the BOE RPIX inflation forecasts are very pessimistic 



by assessing the corresponding fan charts. He argued that the BOE over-estimated the 
probability that a given target range would be breached. In this paper, we demonstrate 
that the distributions can be made sharper by mixing them with the unconditional 
density. To this end, the new forecast density shall be given by 



ft{x) = apt{x; Xai^t, A(T2,t) + (1 - a)p„(x). 



(15) 



The parameters A and a are selected by minimising Ignorance over a forecast- verification 
archive and they depend on lead time. The use of A in (I15p accommodates the BOE's 
judgement and allows the mixtu re distributions to be sha rper than the BOE forecasts. 
Forecast combination as done in Hall and Mitchell ( 2007 ) corresponds to setting A = 1 
and then finding an 'optimal' a. According to Proposition [H this cannot yield sharper 
forecasts. The case A = 1 and a = 1 corresponds to the BOE forecasts and in sequel, 
this situation shall simply be referred to as a = 1. 

In order to estimate the unconditional density, we need to make sure t hat inflation 



exhibi ted s tationarity wi thin the epoch under investigation. The works of lBoero et al. 



(i2008l ) andlBenatil \20()4 ) argue that the inflation targeting policy of 1992 introduced a 
break in the dynamics of inflation and marked the beginning of a period of remarkable 
stability. Using unit-root tests, they found that giving independence to the BOE in 1997 
introduced no break in the dynamics of RPIX inflation. In light of this, we estimated 
the unconditional density using data for the period from 1992 to 2004. 

For each forecast lead time considered, there were 44 forecasts. Since this is a 
small data size to deal with, we perform a cross validation approach to quantify the 
effect of including the unconditional density on the sharpness and calibration of the 
density forecasts. This is done by successively leaving out one forecast from the training 
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2.5r --- 2.5r 
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UK RIPX Inflation UK RIPX Inflation 

Figure 1: Graphs of density forecasts of RP IX inflation as issued by the Bank of England (left) 
and when unconditional density is included (right). 



set. Each 43 member training set is then used to estimate values of A and a that 
minimise Ignorance. These values are then used on each excluded forecast to produce 
the mixture density forecast. Entropies and PITs of these out of sample distributions 
are then compared with those of the BOE forecasts. We also compare the out of sample 
forecaster's unconditional density with the BOE unconditional density via the Hellinger 
distance. 

A pair of density forecasts is shown in figure [1] to illustrate the effect on sharpness 
of mixing BOE forecasts with the unconditional density according to (I15p . Visual in- 
spection suggests an increase in sharpness due to mixing with the unconditional density. 
Using entropy to measure sharpness, the forecasts were compared with the unconditional 
distribution. A graph of the percentage of forecasts sharper than the unconditional dis- 
tribution against lead time is shown in figure [2] on the left. From the graph it is evident 
that predictive distributions as issued by BOE are all less sharp than the unconditional 
distribution from as early as four quarters and ahead. This undermines the value of BOE 
fan charts from one year ahead and above in favour of the unconditional forecaster. At 
a lead time of one quarter ahead, it appears that mixing with the unconditional density 
results in mixture distributions that are less sharp than BOE forecasts. This, however, 
was at the expense of marginal calibration as is evident on the right hand graphs in 
figure [21 

The graphs of Hellinger distance of forecaster's unconditional densities from the un- 
conditional densities in figure [2] highlight the effect, on marginal calibration, of mixing 
with the unconditional density. At all the lead times considered, it is clear that mixing 
with the unconditional densities out-performed the BOE forecasts. Moreover, there is 
gain with respect to marginal calibration. Concerning probabilistic calibration, we con- 
sider only the lead times of two and three quarters ahead. The corresponding PIT graphs 
are shown in figure [3l Visual inspection suggests that mixing with the unconditional 
density does not have a significant effect on the quality of the PITs. 
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Figure 2: (left) Graphs of percentage number of predictive distributions sharper than the uncon- 
ditional distribution at various lead times and (right) graphs of Hellinger distance of forecaster's 
unconditional densities from the unconditional distribution versus lead time. 
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Figure 3: (left) Distributions of PITs for RPIX inflation forecasts at lead times of 2 and 3 
quarters ahead. 
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6 Discussion and Conclusions 



This paper presented a new theoretical and empirical analysis of the quality of den- 
sity forecasts i n term s of sharpness and calibration. It revisited the conjecture of 
Gneiting etdl (|2007l ) that a sufficiently calibrated forecaster is no more sharper than 



the ideal forecaster and proved a relevant proposition. It turned out that one cannot 
have both probabilistic and marginal calibration hold when the underlying model is 
mis-specified unless they settle for the unconditional distribution. Therefore, the paper 
argued for scaling down calibration expectations when facing model mis-specification. 
It focused upon combining conditional forecasts with the unconditional density. 

It was found that including the unconditional density via the logarithmic scoring 
rule tended to improve marginal calibration and maintain probabilistic calibration. This 
could be accompanied by a corresponding increase in sharpness as measured by entropy. 
Improvement in marginal calibration increased with lead time with no obvious compro- 
mise to probabilistic calibration. Fairly calibrated predictive distributions at higher lead 
times were found to be generally sharper than the unconditional distribution, thus af- 
fording early warning. Crucially, though, some of the density forecasts may have larger 
entropy than the unconditional distribution. Such forecasts may have to be rejected in 
favour of the unconditional distribution, which is sharper. These observations were made 
on RPIX inflation forecasts issued by the BOE. The forecasting model was nonlinear 
and mis-specified, having some stochastic component. 

Relative to the unconditional distribution of RPIX inflation data, we found the BOE 
density forecasts to be very pessimistic, especially from lead times of four quarters and 
above. At lead times above three quarters, the unconditional distribution was found to 
be sharper than all the BOE forecasts. This undermines the value of the BOE forecasts 
at these lead times. Taking into account the BOE's judgement, mixing conditional 
forecasts with the unconditional density was found to yield sharper forecasts than the 
unconditional distribution. For instance, at a lead time of three quarters, only 30% of 
the BOE density forecasts were sharper than the unconditional distribution. Mixing 
with the unconditional density improved sharpness of the forecasts so that about 75% 
of them became sharper than the unconditional distribution. 

It is useful to note that our calculations based on the BOE forecasts were performed 
on only 44 data points. This indicates that the methodology is relevant to data poor 
situations. If the system was not stationary, one would need to transform the time series 
to ensure stationarity. The methodology is applicable to both linear and nonlinear, 
stochastic systems provided stationarity has been established. Nonlinearity abounds in 
finance with examples in stock markets ( Kan as . 2003 : Linden et al . 19931 ). In macro- 



economics, GDP forecasting is an immediate example that would also be profited by 
this study. 
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A Generalised Construction of Probabilistically Calibrated 
Forecasts 



PROPOSITION 5. Suppose that Gt is a continuous strictly increasing distribution 
function on an interval If. Let I be any interval and choose for each t a strictly increasing 
continuous map hf : I ^ It- A probabilistically calibrated forecast distribution function 
precisely takes the form 



1 ^ 

F,ix,) = -Y,Gt [ht{h;\x,)]]. 



(A.l) 



i=l 



Equation (jA.ip is iust lGneiting et al\ (j2007l )'s constru ction in ^ 2.4 except t hat T is 
general rather than 2 and the hnear maps x and x/a that iGneiting et al\ (j2007l ) use are 
replaced by the nonlinear maps hf. Note that each Fg is a strictly increasing continuous 
distribution on Is and they are probabilistically calibrated forecasts of the Gf 's, because 
given < p < 1 there is some x in / with 

T 



t=i 



whence ^(p) = ht{x) and 



T T 
\Y,Gt{F^\p)] = \;Y.Gt{ht{x)} 



p. 



t=i 



t=i 



Moreover, any probabilistically calibrated forecast of Gt takes exactly this form. To 
see this, let I be any interval and hi be any suitable map from / onto Ii and then define 
ht{x) = F^\Fi{hi{x)}]. It then follows that 

T T 

^Y.Gt [ht{h-\xs)]]=^Y.Gt [Fr'{Fs{xs)}]=Fs{xs). 



t=i 



t=i 



The first equality follows by definition of the ht functions and the next by the proba- 
bilistic calibration property. Hence the Fts have exactly the form of the construction. 



B Proof of Proposition [T] 

Using proposition [5l probabilistic calibration implies that Ft takes precisely the form 
given in (jA.ip . If the sequence {Ft} is also finite marginally calibrated, we can substi- 
tute (jA.ip into ([5]) to obtain 

T T T 

JrF^)^ ^ Gt[ht{h-\x)}]=^Y.Gtix), T>2. 

^ ^ t=l S = l,Sytt t=l 

It is, therefore, required that Gt [ht {h~^{x)}~\ = Gi{x), wherei G {1, ..,T}. li Gt[ht{h~^ {x 
Gs{x) for any s, then the forecasts {Ft} are ideal. On the other hand, if Gt[ht{hj^ (x)}] = 
Gt{x), then we have the finite unconditional forecaster (FUF). 



20 



We now wish to show that a non-FUF forecaster who is both probabihstically and 
marginally calibrated is precisely the ideal forecaster. Consider Fs{x) as defined by 
equation (|A.ip for a given s. Suppose there exists q such that 

Gt[ht{hJ^{x)}]=Gsix), for all t<q (B.l) 

and 

Gt[ht{h-\x)] = Gt{x) for all t > q. (B.2) 

Equation (IB.ip implies that Gs[/is{/i^^(x)}] = Gt{x) for all t < q while (lB.2h implies 
that ht{x) = hs{x) for all t > q. Fs{x) contains q counts of Gs{x). Each 

1 ^ 

F,{x) = -^Gt [h{h7\x)}], 
t=i 

i s, contains counts of Gs{x) if i < q. U i > q, we get 

Gt[ht{hr\x)}] = Gt[ht{K\x)}] = Gs{x), 

for all t < q. The first equality follows from noting that hi{x) = hs{x) and the second 
from applying (jB.ip . Hence each Fi{x) contains q counts of Gs{x). Therefore, all the 
summations on the right hand side of the forecasters contain q+{T—q)q counts of Gs{x). 
Finite marginal calibration imposes the requirement that q+{T — q)q = T, which holds 
if and only \i q = T. But q = T implies that we have ideal forecasts. 

More generally, the sequence {Gt[ht{h~^{x)}]\t>q may contain multiplicities of the 
Gt{x) terms. This means that, for a given t = r > q ior which Gr[hr{hj^ (x)}] = Gr{x), 
there may be at least another p ^ r and p > q such that Gp[hp{hj^ (x)}] = Gr{x). Let 
j be the number of all p's for all r's as defined above. Then the total number of Gs{x) 
terms over the right hand sides of all Fi{x) and Fs{x) is q + {T — q — j)q. Marginal 
calibration imposes the condition that 

q+{T-q-j)q = T - q(T - j + 1) + T = 0. 

For the above quadratic equation to have an integer solution in q, the discriminant must 
be a perfect square, which happens if and only if j = 0. Hence a non-FUF forecaster 
who is both finite marginally and probabilistically calibrated must have issued ideal 
forecasts. 



C Proofs of Propositions [2] and [3] 



Proof of proposition [2l The second partial derivative of equation (llOp with respect 
to the mixture parameter a yields 



9^(IGN) 



1 ^ 

t=i 



' P^'Hst\VT)-Pc{st) Y 



Hence the first derivative of (IGN) with respect to a is an increasing function of a. It 
follows that the first derivative will have a zero at some q = a* G (0, 1) if and only if 



O(IGN) 
da 



< and 



a=0 



9(IGN) 
da 



> 0. 



a=l 
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These are essentially the inequalities in the proposition. The second derivative implies 
that Q^, is a global minimiser of the score. 

Proof of proposition [3} 5(IGN)/9cr = implies that 

But 9(IGN)/9a = implies that 

which may be plugged into the left hand side of the previous equation to complete the 
proof. 



D Combining point forecasts 

Consider two forecasting models each with standard deviation of errors given by cJi and 
a"2, respectively. Further more, suppose the correlation coefficient of the forecasting 
errors of these models is p. We assume that forecasting errors of each model are not 
biased, otherwise the modeller can always correct the bias. If we make a forecast com- 
bination, Tjc = ayi + (1 — a)y2, of models 1 and 2, then the following proposition, which 
is a counter part of Proposition [2l holds: 

PROPOSITION 6. Suppose that the standard deviations of forecasting errors of two 
models are o"i,(T2 ^ 0, respectively. If the forecast errors are not biased, then the neces- 
sary and sufficient condition for improvement of the combined forecast in the sense of 
mean squared errors is 

,<min|^,^l, (D.l) 



where p is the correlation coefficient of the forecasting errors. 

Proof: Suppose the forecast errors of each model are ei and 62, respectively. Then 
the forecast error of the combined model is Cc = aei + (1 — a)e2. Since we assumed the 
errors are not biased, it follows that E[ej] = 0, i = 1,2. Therefore, cr? = E[e?], i = 1, 2. 
The mean squared errors of the combined forecast then satisfy the relation = E[e^], 
whence 

al = a^al + (1 - afal + 2pa(l - a)aia2. (D.2) 
Differentiating (1D.2|1 with respect to a yields 

da^ 

— ^ = 2a{al + al- 2paia2) - 2{al - paia2). (D.3) 
da 

At the extremum of the variance <t^, dal/da = 0, which yields 



a* = /^;^"r^ . (D.4) 
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In order to deduce that a* is a global minimum, it suffices to note that 




2{al 2/9CJ1CJ2) > 0. 



da2 



Condition (jP.ip follows upon imposing the requirement that < a* < 1. 

Note that if either variance of the forecast error vanishes, there is no need for model 
combination. The above proposition together with its proof lead us to the following 
corollary: 

Corollary 1. // condition W. j|) holds, then the variance of the forecast errors of the 
combined model, cr^, a* is smaller than that of either constituent model, i.e. 



The above corollary may be viewed as a point forecast parallel of Proposition [3j In 
some sense it guarantees the 'sharpness' of the combined forecast. Unfortunately, here 
sharpness has to be thought of as a property of both forecasts and observations. 



2 • r 2 
< mm | cr^ 




(D.5) 
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